Contents

Setup

Install and load package

if (!require("devtools")) {
  install.packages("devtools")
}
devtools::install_github("shbrief/PCAGenomicSignatures")
library(PCAGenomicSignatures)

Download PCAmodel

Currently, you can download PCAGenomicSignatures from Google Cloud bucket using PCAGenomicSignatures::getModel function. This model is built from top 20 PCs of 536 studies (containing 44,890 samples) containing 13,934 common genes from each of 536 study’s top 90% varying genes based on thier study-level standard deviation. There are two versions of this using different gene sets for GSEA-based annotation; MSigDB C2 (C2) and three priors from PLIER package (PLIERpriors). In this vignette, we are using the C2 annotated model.

wd <- getwd()
fpath <- file.path(wd, "PCAmodel_C2.rds")
if (!file.exists(fpath)) {getModel("C2", dir = wd)} 

PCAmodel <- readRDS(file.path(wd, "PCAmodel_C2.rds"))
PCAmodel
#> class: PCAGenomicSignatures 
#> dim: 13934 4764 
#> metadata(6): cluster size ... MeSH_freq updateNote
#> assays(1): model
#> rownames(13934): CASKIN1 DDX3Y ... CTC-457E21.9 AC007966.1
#> rowData names(0):
#> colnames(4764): PCcluster1 PCcluster2 ... PCcluster4763 PCcluster4764
#> colData names(4): PCcluster studies silhouetteWidth gsea
#> trainingData(2): PCAsummary MeSH
#> trainingData names(536): DRP000987 SRP059172 ... SRP164913 SRP188526

Example dataset

Human B-cell expression dataset The human B-cell dataset (Gene Expression Omnibus series GSE2350) consists of 211 normal and tumor human B-cell phenotypes whose expression was profiled on Affymatrix HG-U95Av2 arrays, and it is contained in an ExpressionSet object with 6,249 features x 211 samples.

if (!"bcellViper" %in% rownames(installed.packages())) {
  BiocManager::install("bcellViper")
}

suppressPackageStartupMessages(library(bcellViper))
data(bcellViper)
dset
#> ExpressionSet (storageMode: lockedEnvironment)
#> assayData: 6249 features, 211 samples 
#>   element names: exprs 
#> protocolData: none
#> phenoData
#>   sampleNames: GSM44075 GSM44078 ... GSM44302 (211 total)
#>   varLabels: sampleID description detailed_description
#>   varMetadata: labelDescription
#> featureData: none
#> experimentData: use 'experimentData(object)'
#> Annotation:
dataset <- exprs(dset)   # genes in SYMBOL

You can provide your own expression dataset in any of these formats: simple matrix, ExpressionSet, or SummarizedExperiment. Just make sure that genes in rows are in a ‘symbol’ format.

Validate

HeatmapTable

heatmapTable outputs a two panel table: top panel represents average silhouette width (avg.sw) and the bottom panel represents the validation score.

You can display the validation output in multiple ways. For example, if you specify scoreCutoff argument of heatmapTable, any validation result above that score will be shown. If you specify the number of top validation results (= n) through num.out argument of heatmapTable, the output will be a n-columned heatmap table. You can also use the average silhouette width (swCutoff), the size of cluster (clsizecutoff), PC from dataset (whichPC).

Here, we print out top 5 validated PCclusters with > 0 average silhouette width.

val_all <- validate(dataset, PCAmodel)  
heatmapTable(val_all, num.out = 5, swCutoff = 0)

Interactive Graph

Under the default condition, plotValidate plots all non single-element clusters’ validation results in a single graph, where x-axis represent average silhouette width of the PCclusters (a quality control measure of the signature) and y-axis represent validation score. We recommend users to focus on PCclusters with higher validation score and use avgerage silhouette width as a secondary criteria.

plotValidate(val_all, interactive = FALSE)

plotValidate(val_all, interactive = TRUE, minClusterSize = 4)


You can hover each data point for more information:
- sw : the average silhouette width of the clutser
- score : the top validation score between 8 PCs of the dataset and the cluster
- cl_size : the size of the cluster, represented by the dot size
- cl_num : the PCcluster number. You need this index to find more information about the cluster.
- PC : Test dataset’s PC number that validates the given PCcluster. Because we used top 8 PCs of the test dataset, there are 8 categories.

If you double-click the PC legend on the right, you will enter an individual display mode where you can add an additional group of data point by single-click.

MeSH terms in wordcloud

You can draw a wordcloud with the enriched MeSH term of PCclusters that validate your dataset. Besed on the heatmap table above, 1st-3rd PCclusters (2538, 1139, 884) show high validation scores with positive average silhouette widths, so we draw wordclouds of those PCclusters using drawWordcloud function. You need to provide PCAmodel and the index of the PCcluster you are interested in.

Index of validated PCclusters can be easily collected using validatedSingatures function, which outputs the validated index based on num.out, PC from dataset (whichPC) or any *Cutoff arguments in a same way as heatmapTable.

validatedSig <- validatedSignatures(val_all, num.out = 3, swCutoff = 0)
validated_ind <- validatedSig[,"cl_num"]

set.seed(1)
drawWordcloud(PCAmodel, validated_ind[1])

drawWordcloud(PCAmodel, validated_ind[2])

drawWordcloud(PCAmodel, validated_ind[3])

GSEA

Annotation on PCcluster1139

Because the test dataset is human B-cell expression data, we tried the model annotated with blood-associated gene sets.

wd <- getwd()
fpath <- file.path(wd, "PCAmodel_PLIERpriors.rds")
if (!file.exists(fpath)) {getModel("PLIERpriors", dir = wd)} 

PCAmodel <- readRDS(file.path(wd, "PCAmodel_PLIERpriors.rds"))
PCAmodel
#> class: PCAGenomicSignatures 
#> dim: 13934 4764 
#> metadata(6): cluster size ... MeSH_freq updateNote
#> assays(1): model
#> rownames(13934): CASKIN1 DDX3Y ... CTC-457E21.9 AC007966.1
#> rowData names(0):
#> colnames(4764): PCcluster1 PCcluster2 ... PCcluster4763 PCcluster4764
#> colData names(4): PCcluster studies silhouetteWidth gsea
#> trainingData(2): PCAsummary MeSH
#> trainingData names(536): DRP000987 SRP059172 ... SRP164913 SRP188526